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Ethernet has learned a new trick in the past couple of years. It has learned how to scale its 
link speed by a factor of ten. lOBASE-T begot lOOB ASE-T/ which in tira will be scaled 
up to one gigabit per second with lOOOBASE-T. The Ethernet community has deliberately 
and emphatically rejected the idea of specifying speeds in anything other than multi- 
ples of 10. Pipposa^^^ 

speeds between 10 and 100 Mbps; and they have recently been made again in the context 
of Gigabit Ethernet, and they have been met with the same reception. No one who sells 
networks or com^ 

interpperabl^v intermediate link speeds, and everyone seemts to feel more comfortable 
when the link speed can be expressed as pov/e^ of die number of o toes. 

Never the less, at any given point in time, the choice of link speeds (10, 100, 1000 Mbps) 
may not match up very well with the amount of sustained throughput that a particular 
device can support. VutuaUy all muld-processor servere sWpped tod^ can sustain greater 
than 100 Mbps aggregate network transfer rates. Furthermore, when switches and high 
performance routers are used to interconnect multiple links of a given speed, there is a 
clear need for the inter-switch or inter-router liiik to be able to support at least some aggre- 
gation of the liiiks. The next power of ten increase in link speed may not be an attractive ' 
choice from a cos t standpoint unless the utilization of the higher speed link is going to be 
greater th^ 40 to 50 percent. Kicking up speed also requires new hardware. 

To solve this problem, the concept of "trunking' ■ has long been used in some networks. For 
the purposes of this discussion, trunking will be defined as the ability to combine multiple 
pardlel physical liriks into^^.o^^^^ Uinit ourselves to trunks in which 

the physical liriks sh£u:e a comrnon ^^s^^ 

limit ourselves to trunks in which each of the links (or "segments" of the toiiik) have iden- 
tical physical layer and media access control layer characteristics. We have paid particu- 
iar atterition to the way IP packets will be transported over these trunks, but we don*t 
believe that the model described herein is in any way ^1^ 

The remainder of this paper will describe a set of mles that the equipment on each end of 
the trunk must agree to. The rules can be applied to either end stations (DTEs, such as 
computers of any classification) ornetwork infrastructure components (specifically 
switches). The rules are symmetric, which is to say that both ends are subject to the same 
rules. 



Simple Trunking Model (STtuM) 



August 29, 1996 



1 



2. Rules 

1. A trunk may have any number of segments, but all segments must have identical physi 
cal layer and media access wntrol layer characteristic^ 

2. Each segment of the trunk shares; a common source and a common destination with the 
other segments of the trunk ; 

3. Temporal ordering of the packets transported across a given segment of the trunk must 
be preserved throughout the network, subject only to loss due to bit errors 

4. Temporal ordering of the packets transported across dififerent segments of the trunk 
must not be assumed y [ 

5. Packets must not be replicated or duplicated across the segments- of a trunk. This 
includes broadcast and, multicast packets 

6. Broadcast and multicast packets transmitted through a segment of the trunk must not be 
"echoed" or "looped-back" to the sender over the othet segments of the frunk 

7. The model assumes full duplex operation at the physical and m^dia access control lay- 
ers for each se^ent. Half duplex operation, using GSMA/GD, is neither supported nor 

■■^■desired .'-^ ■-..■'^ ■ 

8 - End stations connected to trunks will associate a, single 48 bit IEEE MAC address with 
all segments of the trunk. ; / '. y- \ 

9. Load balancing across the segments is not issumed to be perfect. Each end of the trunk 
will attempt to load balance across the segments to the best of its ability, subject to all 
of the forgoing rules 

These ndes do not addre^^^^^ or miaintenance of 

trunks, nor do they address failure detection or recovery. It is assumed that the physical 
layer will provide some indication if a particular segment of the trunk fails, and that each 
end of the trunk will monitor whatever status is provided by the physical layer, and take 
whatever action is deemed appropriate. Configuration and setup are assumed to be per- 
formed via nianual operations specific to each imiplem^ 

3. Proposals for load balancing ; 

The diagrams in Figure 1 through Fig^e 3 may assist the reader in understanding the pro- 
posals . ■ , ■ ■ 

Figure 1 shows a trunk connectipn between a server (as a specific example of a DTE) and 
a switch. The example shows a trunk with 3 segments, labeled A, B, and C. The switch is 
also connected to several client se^ents, labeled a, b, .c, etc.. In this configuration, it is 
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possible to replace the server with other typ^s of equipment, such .as a router, or a high 
performance workstation, pr a printer, ^to list a few examples. 
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HGURE l. TVuhks between servers an 

Figure 2 sihpws a dunk used as a connection between two switches. As in the previous : 
exampleVthere is no special significance to the number of segments which make up the 
trunk in Figure 2. The trunk could just as easily be made of two or four or practically any 
number of segments, depencUng on the amount 
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FIGURE 2. Thinks beUveiu^ 

Figure 3 shows a trunk employed between two servers. Once again, either or both of the 
servers eould be replaced with some other type of equipment, such as a router. 
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FIGURE 3. Thinks between servers 



simple TVunking Model (STruM) 



Augusc29, 1996 



3 



3.1 Load balancing in servers 



An important goal of the model is to make the trunk appear like a single high bandwidth 
interface from the viewpoint of the server*s protocol stack- In addition,.all clients which 
communicate with the: server via the trunk should have a consistent view of the server's 
identity (MAC and IP host addresses). Load balancing by managing the Address Resolu- 
tion Protocol (ARP) tables in the clients has been considered and rejected because it 
would dramatically increase the number of ARP frames emitted by the server, and would, 
require a significant amount of added functionality in the server's ARP implementation. 

In order to satisfy Rule # 4, "Temporal ordering of the packets transported across different 
segments of the trunk must not be assumed"the server must ensure that aU packets of any 
sequence of packets which requires temporal ordering are transmitted over the same seg- 
ment of the tnink. 

It should be noted that transport protocols genially can recover from situations where 
packets arrive put of order, but that this generaUy entails a significant degradation in 
throughput, because out of order reception is han an exception, and is not opti- 
mised. Therefore, the server load balancing mechanism should be designed to take advan- 
tage of Rule # 3V"Temporal ordering of the packets transported across a given segment of 
the trunk must be preserved throughout the network, subject only to loss due to bit errors". 

At this point, it might be helpful to introduce a diagram which shows the software compo- 
nents of a trunked server interface. 
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FIGURE 4. Software components of a tr^ 

. ■ ' ■ ' ' - . ■ -■ ■■ ' ' '''(' . ■ - - 

A "tmnking pseudo driver" is introduced between the IP protocol layer and the network 

device driver. The function of the pseudo driver is to act as a demultiplexor in the transnut 

path,:and a multiplexor in the receive path. In order to satisfy the rules regarding temporal 

ordering, the pseudo driver will attempt to ensure that all of the packets associated with a 
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particular transport layer datagram are enqueued on the same network device transmit 
queue. This assumption is made on flie basis that ordering within a datagram is sufficient, 
and that ordering between datagrams is unnecessary, 

It would be quite difficult for the pseiido driver code to inspect the headers of each packet 
and attempt to associate them with a particular datagrani, though one can imagine that this 
could be accomplished given an arbitrarily fast processor to execute the code. As a first 
approximation, the authors feel that it would be sufficient to keep a small cache of MAC 
(or IP) destination addresses associated with each network interface (each segment of the 
trunk). Thus, when IP hands the pseudo driver a packet, the pseudo driver checks the 
cache to see if it has recently transmitted a packet to this DA. If it has, the pseudo driver 
will enqueue the packet on the same interface that it enqueued the last packet to this DA. If 
this DA has not been transmitted to recendy, the pseudo driver can cither enqueue the 
packet on the least busy transmit queue (the emptiestqueue), or the next available queue in 
a round robin fashion; What ever queue it selects, the driver must update the cache for chat 
queue with the new DA. 

In the degenerate casein which a server is talking to one ani only one client, this tech- 
nique would ensure that all packets to that client travel over the same interface, and hence 
the same segment of the mmk. Why do we call this load balancing? Because it works 
much better in a non-degenerate case, and will do a good job of ensuring ordered delivery 
if we can safely assume that the degree of interleaving of packets to different DAs between 
IP and the network driver is of the same order as the number of processors in a given 
server. As a first guess, the depth pfthe cache should be equal to roughly twice the number 
of processors in a given server. The deeper the cache, the more casual the updating to the 
cache can bei Experimentation to derive the optimal value for the depth of the cache, and 
to explore the trade-offs between caching layer 2 and layer 3 addresses is warranted. 

3.2 Load balancing in switches 

Several switch load balancing mechanisnis are possible for forwarding packets into a 
trunk. The set of load balancing guidelines listed below apply to both switch lo switch and 
switch to server trunks. They ensure that the switch behavior is consistent with conven- 
tional bridging guidelines » 

1 . No frame misordering for a given priority level between a given MAC source and desti- 
nation. 

2. No frame duplication. 

3> Transparent to protocols operaiting above the MAC layer 

A natural approach to load balancingis-to-emulate^arf^stei^linJ^by-keepinf^ — 
ments equally busy, possibly by using the corresponding output queue as the metric for 
how busy the segment is. As long as the links implement flow control, the output queue 
length is a good end to end proxy for the segment utilization (without flow control, a high 
segment load is not necessarily reflected in the state of the output queue due to packet loss 
on the receive queue at the other end). 
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Deciding for every packet which segment to use, based solely on queue length, might lead 
to the frame misofdering prohibited by the first guideline. If the decision is only a function 
of the source address of the packet, or of the packet's port of arrival then the first guideline 
is always satisfied. This scheme hbwever results in a static load balancing function, and 
the trunk effectiveness depends on the distribution of the traffic sources. While we antici- 
pate that large number of traffic sources would result in acceptably even distributions, it is 
stUl possible to end up with configurations wliere the mapping fimction forwards most of 
the traffic to the same segment. 

To handle these cases, It is possible to have a dynamic mapping function and still mainUin 
frame ordering, as long as the function changes are slower than the output queue transit 
times. For instance, the mapping for a given source address can be detenmned at the time 
the first packet with the source address is seen, and eventuaUy aged when the source 
address is not seen for a period of time. , 

Having the mapping function consider both the source address and the port of arrival 
reduces the number padiolbgical cases. For example if the ttaffic is spatiaUy dominated by 
a particular input port, considering the source address helps spread its trafBc, and con- 
versely the port of arrival helps distribute traffic dominated by a small number of 
addresses (senders or routers) in particular if mote than one trunk exists in the switch. 

Prevention of frame dupUcation is a^^ set of trunked ports as if they 

were a single port, with a separate queue per segment, and maldng s^^^ 

is done to only one of its qiieues. 

Furthermore, for the purposes of other 802.1d functions like leamirig MAC addresses, fil- 
tering: frames, and eJtecuting the Spanning Tree Protocol (if applicable) trunked ports are 
also treated as if they w;ere a single port.- 

So far the discussion was centered around usin^ MAC layer information for load balanc- 
ing It is possible for the switch to observe higher level protocol information in order to 
make better load balancing decisions, as lohg as .the third rule of protocol transparency is 
followed. Transpaieiicy implies that the protocols are not aware nor expUcitly cooperate 
with the switch load balancing fiinction. In addition, for protocols that are not supported or 
understood by the switch, connectivity must be stUl^^g^ 

Load balancing based on higher level information is practical for switches that examine 
Layer 3 headers on a packet by packet.basis. Many switches examiiie Layer 3 headers 
once, for VLAN configuration for example, and;use the corresponding Layer 2 informa- 
tion for packet processing. The potential load balancing merits of this approach were not 
considered. 

3.2.1 Other Approaches 

The aim of the mapping function could also b^ different than equally balancing the seg- 
ments. For example it could separate ttaffic according to priority, or whether the U:af&c is 
bandwidth managed or best effort; A priority based approach is supported by the first 
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guideline, because packet order preservation is not necessaiy across. different priorities. A 
priority based approach is straightforward whenever the priority information is well 
defined at the MAC level (for example if VLAN tags are us^^^ 

Restating the main observations, we have shown that the. switch behavior is conceptually 
simple, guided by a set of simple rules along with the particular switch architecture, and 
can be defined independently of the server load balancm^^ . 
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